An Efficient way of Record Linkage System and Deduplication using Indexing techniques, Classification and FEBRL Framework
نویسنده
چکیده
Record linkage is an important process in data integration, which is used in merging, matching and duplicate removal from several databases that refer to the same entities. Deduplication is the process of removing duplicate records in a single database. In recent years, data cleaning and standardization becomes an important process in data mining task. Due to complexity of today’s database, finding matching records in single database is a crucial one. Indexing techniques are used to efficiently implement record linkage and deduplication. In this paper, three indexing techniques namely blocking index, sorting indexing and bigram indexing are used with a modification of existing techniques that reduces the variance in the quality of the blocking results. In addition to the indexing techniques, six comparison techniques and two classifiers are used. There is a potential for large performance speed-ups and better accuracy to be achieved by using indexing techniques along with comparison and classifier techniques. Keywords—Record linkage,Indexing techniques, data matching, blocking, Febrl framework
منابع مشابه
Febrl – A Freely Available Record Linkage System with a Graphical User Interface
Record or data linkage is an important enabling technology in the health sector, as linked data is a costeffective resource that can help to improve research into health policies, detect adverse drug reactions, reduce costs, and uncover fraud within the health system. Significant advances, mostly originating from data mining and machine learning, have been made in recent years in many areas of ...
متن کاملA Probabilistic Deduplication, Record Linkage and Geocoding System
In many data mining projects in the health sector information from multiple data sources needs to be cleaned, deduplicated and linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient. Most of the time the linkage process is challenged by the lack of a common unique entity identifier. Additionally, personal ...
متن کاملProbabilistic Deduplication, Record Linkage and Geocoding
Outline Background and illustrative example Record linkage Applications, privacy and ethics Our project and our tools Data cleaning and standardisation Probabilistic data standardisation and HMMs Blocking / indexing Record pair classification Geocoding Outlook Peter Christen, May 2005 – p.2/28
متن کاملA proficient cost reduction framework for de-duplication of records in data integration
BACKGROUND Record de-duplication is a process of identifying the records referring to the same entity. It has a pivotal role in data mining applications, which involves the integration of multiple data sources and data cleansing. It has been a challenging task due to its computational complexity and variations in data representations across different data sources. Blocking and windowing are the...
متن کاملProbabilistic Record Linkage and Deduplication after Indexing, Blocking, and Filtering
Probabilistic record linkage, the task of merging two or more databases in the absence of a unique identifier, is a perennial and challenging problem. It is closely related to the problem of deduplicating a single database, which can be cast as linking a single database against itself. In both cases the number of possible links grows rapidly in the size of the databases under consideration, and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013